Scalable Ordinal Embedding to Model Text Similarity

نویسنده

  • Jesse Anderton
چکیده

Practitioners of Machine Learning and related fields commonly seek out embeddings of object collections into some Euclidean space. These embeddings are useful for dimensionality reduction, for data visualization, as concrete representations of abstract notions of similarity for similarity search, or as features for some downstream learning task such as web search or sentiment analysis. A wide array of such techniques exist, ranging from traditional (PCA, MDS) to trendy (word2vec, deep learning). While most existing techniques rely on preserving some type of exact numeric data (feature values, or estimates of various statistics), I propose to develop and apply large-scale techniques for embedding and similarity search using purely ordinal data (e.g. “object a is more similar to b than to c”). Recent theoretical advances show that ordinal data does not inherently lose information, in the sense that, when carefully applied to an appropriate dataset, there is an embedding satisfying ordinality which is unique up to similarity transforms (scaling, translation, reflection, and rotation). Further, ordinality is often a more natural way to represent the common goal of finding an embedding which preserves some notion of similarity without taking noisy statistical estimates too literally. The work I propose focuses on three tasks: selecting the minimal ordinal data needed to produce a high-quality embedding, embedding large-scale datasets of high dimensionality, and developing ordinal embeddings that depend on contextual features for, e.g., recommender systems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Intelligent scalable image watermarking robust against progressive DWT-based compression using genetic algorithms

Image watermarking refers to the process of embedding an authentication message, called watermark, into the host image to uniquely identify the ownership. In this paper a novel, intelligent, scalable, robust wavelet-based watermarking approach is proposed. The proposed approach employs a genetic algorithm to find nearly optimal positions to insert watermark. The embedding positions coded as chr...

متن کامل

Point Localization and Density Estimation from Ordinal Knn Graphs Using Synchronization

We consider the problem of embedding unweighted, directed k-nearest neighbor graphs in low-dimensional Euclidean space. The k-nearest neighbors of each vertex provide ordinal information on the distances between points, but not the distances themselves. Relying only on such ordinal information, along with the low-dimensionality, we recover the coordinates of the points up to arbitrary similarit...

متن کامل

Link Prediction using Network Embedding based on Global Similarity

Background: The link prediction issue is one of the most widely used problems in complex network analysis. Link prediction requires knowing the background of previous link connections and combining them with available information. The link prediction local approaches with node structure objectives are fast in case of speed but are not accurate enough. On the other hand, the global link predicti...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Uniqueness of Ordinal Embedding

Ordinal embedding refers to the following problem: all we know about an unknown set of points x1, . . . , xn ∈ R are ordinal constraints of the form ‖xi−xj‖ < ‖xk−xl‖; the task is to construct a realization y1, . . . , yn ∈ R that preserves these ordinal constraints. It has been conjectured since the 1960ies that upon knowledge of all ordinal constraints a large but finite set of points can be ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017